Establishment and validation of a nomogram model for riskprediction of hepatic encephalopathy: a retrospective analysis

To establish a high-quality, easy-to-use, and effective risk prediction model for hepatic encephalopathy, to help healthcare professionals with identifying people who are at high risk of getting hepatic encephalopathy, and to guide them to take early interventions to reduce the occurrence of hepatic encephalopathy. Patients (n = 1178) with decompensated cirrhosis who attended the First Affiliated Hospital of Guangxi University of Chinese Medicine between January 2016 and June 2022 were selected for the establishment and validation of a nomogram model for risk prediction of hepatic encephalopathy. In this study, we screened the risk factors for the development of hepatic encephalopathy in patients with decompensated cirrhosis by univariate analysis, LASSO regression and multifactor analysis, then established a nomogram model for predicting the risk of getting hepatic encephalopathy for patients with decompensated cirrhosis, and finally performed differentiation analysis, calibration analysis, clinical decision curve analysis and validation of the established model. A total of 1178 patients with decompensated cirrhosis who were hospitalized and treated at the First Affiliated Hospital of Guangxi University of Chinese Medicine between January 2016 and June 2022 were included for modeling and validation. Based on the results of univariate analysis, LASSO regression analysis and multifactor analysis, a final nomogram model with age, diabetes, ascites, spontaneous peritonitis, alanine transaminase, and blood potassium as predictors of hepatic encephalopathy risk prediction was created. The results of model differentiation analysis showed that the AUC of the model of the training set was 0.738 (95% CI 0.63–0.746), while the AUC of the model of the validation set was 0.667 (95% CI 0.541–0.706), and the two AUCs indicated a good discrimination of this nomogram model. According to the Cut-Off value determined by the Jorden index, when the Cut-Off value of the training set was set at 0.150, the sensitivity of the model was 72.8%, the specificity was 64.8%, the positive predictive value was 30.4%, and the negative predictive value was 91.9%; when the Cut-Off value of the validation set was set at 0.141, the sensitivity of the model was 69.7%, the specificity was 57.3%, the positive predictive value was 34.5%, and the negative predictive value was 84.7%. The calibration curve and the actual events curve largely overlap at the diagonal, indicating that the prediction with this model has less error. The Hosmer–Lemeshow test for goodness of fit was also applied, and the results showed that for the training set, χ2 = 1.237587, P = 0.998, and for the validation set, χ2 = 31.90904, P = 0.0202, indicating that there was no significant difference between the predicted and actual observed values. The results of the clinical decision curve analysis showed that the model had a good clinical benefit, compared with the two extreme clinical scenarios (all patients treated or none treated), and the model also had a good clinical benefit in the validation set. This study showed that aged over 55 years, complications of diabetes, ascites, and spontaneous bacterial peritonitis, abnormal glutamate aminotransferase and abnormal blood potassium are independent risks indicators for the development of hepatic encephalopathy in patients with decompensated cirrhosis. The nomogram model based on the indicators mentioned above can effectively and conveniently predict the risk of developing hepatic encephalopathy in patients with decompensated cirrhosis. The nomogram model established on this study can help clinical healthcare professionals to timely and early identify patients with high risk of developing hepatic encephalopathy.


Statistical analysis
Data grouping Random numbers were generated using R software, and patients with decompensated cirrhosis included in the study were randomly divided into a training set(70%) and a validation set(30%).The training set was set to construct a risk prediction model, and the validation set was set to verify the accuracy of the prediction model.

Statistical descriptions
Analysis of variance was performed for the 77 selected relevant indicators in the training and validation sets.Between-group comparisons were made between patients' data in the hepatic encephalopathy and non-hepatic encephalopathy groups.Among these indicators, continuous variables such as white blood cell count (WBC),  www.nature.com/scientificreports/were compared between groups using one-sample independent t tests or Wilcoxon rank sum tests, expressed as Mean ± SD or Median (P25, P75).Categorical variables, such as smoking history, hypertension, diabetes mellitus, presence of gastrointestinal bleeding, presence of ascites, presence of cardiovascular disease, were compared between groups by Pearson's chi-square test or Fisher test for rates and expressed as frequencies (percentages).Statistically significant difference was set at P < 0.05.

Handling of missing values
Multiple interpolation of missing data was performed using SPSS software.Most of the traditional methods for handling missing values use median or mean for interpolation.Multiple interpolation deals with missing values by using other variables given in the dataset, fitting the missing values by iteration and pre-defined matrix construction models, and then using the fitted predicted values to multiply fill the missing values of this variable.This method gives a higher accuracy of the missing value alternatives.

Model establishment and demonstration
(1) Determination of independent risk factors Least absolute shrinkage and selection operator (LASSO) regression was performed using the "glmnet" package of R software.LASSO regression is a linear regression that avoids overfitting by imposing a penalty on the magnitude of the model coefficients.Some of the variables derived from the LASSO regression mightnot be significantly correlated with the results in the multi-factor logistic regression analysis.
(2) Establishment of the model After screening the predictor variables by LASSO regression, the variables with P < 0.1 were used as predictors, and the risk prediction model was constructed by binary logistic regression using the glm function in R software.
(3) Presentation of the model In order to visualize the weights of each predictor and to make the established risk prediction model more convenient and concise for clinical application, the "rms" package of R software was used to build a nomographic plot based on the results of the multi-factor logistic model by using the lrm function and nomogram function.

Evaluation and validation of the model
The risk prediction model for patients with decompensated cirrhosis getting hepatic encephalopathy built based on the training set was evaluated in terms of its discriminative efficacy, consistency test and clinical benefit.The model was validated in the validation set.
(1) Evaluation of the model ROC (receiver operator charteristics), area under curve (AUC), concordance index (C-index), sensitivity, specificity, positive predictive value (PPR), and risk prediction model were used to evaluate the model.The area under the ROC curve reflects the discriminative power of the model.The risk prediction model is considered to be having good discriminatory ability if the area under the ROC curve was greater than 0.7.On the contrary, when the area under the ROC curve was close or even equals to 0.5, the risk prediction model would be considered to be having low diagnostic value.
(2) Validation of the model In this study, we used the bootstrap resampling method, the Hosmer-Lemeshow test, the ROC curve, the area under the ROC curve, and the calibration curve to measure and validate the model.The clinical benefit of the model was evaluated using decision curve analysis.
This study used Excel software for data entry and SPSS 26.0 and R 3.6.3software for statistical analysis of the data.All P values were two-sided tests, P < 0.05 indicates that the differences are statistically significant unless otherwise stated.

Sample size estimation and general information of the included patients
We calculated the minimum sample size required for modeling in this study as 323 cases based on R 2 and 344 cases based on C-index.According to the sample size estimation, we included 1178 patients with decompensated cirrhosis were finally included in the analysis, including 203 patients who developed hepatic encephalopathy within six months and 975 patients who did not.849(72.1%) of the 1178 patients in the sample were male, 329(27.9%)were female.585 patients(49.7%)were < 55 years of age, 593 patients(50.3%)were over 55 years of age.195 patients(16.6%)had ascites; 410 patients(34.8%)had infection; 193 patients(16.4%)had diabetes mellitus; as shown on Table 1.

Comparsion of the training and validation sets
1178 patients were randomly divided into a training set and a validation set in the ratio of 70% and 30%, where the training set n = 826 and the validation set n = 352 (Table 2).The training set included 128 patients who heve hepatic encephalopathy, accounting for 15.5% of the group; the validation set included 75 patients who have hepatic encephalopathy, accounting for 21.3% of the group.Statistical analysis was performed on the general data, and the results shown that except for the neutrophil ratio (NEUT), absolute neutrophil value (NEUP), immunoglobulin M (IgM), and a-L-amyloidase (AFU), the differences between the two groups were not statistically significant (P > 0.05) for all variables.This indicates that the indicators in the training and validation sets are evenly distributed, which can effectively avoid the conclusion bias.

Assignment of variables in the model for risk prediction of hepatic encephalopathy
A total of 1 dependent variable (occurrence of hepatic encephalopathy) and 77 independent variables were included in this study.Table 4 assigned values to the variables individually and converted the corresponding variables to categorical variables, including the assignment of dichotomous variables and the handling of dummy variables of multicategorical variables.The occurrence of HE was set as the dependent variable Y.The independent continuous variables (X) such as white blood cell count, platelet count, hemoglobin, glutamate aminotransferase, blood creatinine, blood urea nitrogen, urea, glucose, total cholesterol, triglycerides, total bile acids, albumin, were still included in the model analysis as numerical variables.

Univariate logistic regression analysis
The occurrence of hepatic encephalopathy was used as the dependent variable Y, and all candidate predictors based on 826 patients in the training set were used as independent variables.Univariate logistic regression analysis was performed to screen the potential predictors.The results shown that the following variables were considered to be the potential predictors and were entered into the regression equation (Table 5): age, serum alkaline transaminase (AKP), glutamic aminotransferase (ALT), adenosine deaminase (ADA), glutamic aminotransferase (AST), diabetes mellitus, serum lactate dehydrogenase (LDH), apolipoprotein A, serum potassium, red blood cell volume (MCV), ascites, spontaneous bacterial peritonitis, and gastrointestinal bleeding.

Predictors screening
LASSO regression was performed using the "glmnet" package in R software.All independent variables were screened using LASSO regression, and the adjustment parameter λ was validated using the ten-fold crossover method.Conversely, if the regression coefficient is not zero, it indicates that the variable is strongly associated with the occurrence of HE in patients with cirrhosis.The two dashed lines indicate lambda.min,which represents the value of λ corresponding to the smallest error and which can correspond to the least number of predictor variables, and lambda.1se,which represents the value of λ for the most streamlined model within one standard error of lambda.min.All the independent variables in this study were screened by LASSO regression, and finally 15 variables with non-zero regression coefficients were output at lambda.min.These 15 variables were listed below: age, sex, diabetes mellitus (DM), spontaneous bacterial peritonitis, ascites, gastrointestinal bleeding, serum α-hydroxybutyrate dehydrogenase (α-HBDH), white blood cells ( WBC), red blood cell volume (MCV), serum potassium (K), prothrombin time (TT), serum alkaline phosphatase (AKP), alanine transaminase (ALT), adenosine deaminase (ADA), and plasma ammonia (Ammo).The above 15 predictor variables were included in a multifactorial logistic regression analysis, and six fo these 15 variables showed statistically significant differences (P < 0.05) (Fig. 2A, B, Table 6).

Establishment and demonstration of the model
The risk prediction model for the occurrence of HE in patients with decompensated cirrhosis was established based on the above six predictors (risk factors), and the OR values obtained after incorporating the model are shown in Fig. 3, in which the combination of SBP or not had the greatest effect on the occurrence of hepatic encephalopathy, and the risk of getting hepatic encephalopathy in decompensated cirrhosis patients with SBP was 4.856 times higher than that in patients without SBP (2.66, 8.865); the risk of getting hepatic encephalopathy in decompensated cirrhosis patients older than 55 years of age was 2.26 times higher than in patients not older than 55 years (1.461, 3.494); the risk of getting hepatic encephalopathy in decompensated cirrhosis patients with a history of diabetes mellitus was 1.656 times higher than in patients without a history of diabetes mellitus (1.006, 2.725); the risk of getting hepatic encephalopathy in decompensated cirrhosis patients with ascites was 2.025 times higher than in patients without ascites (1.26, 3.255); the risk of getting hepatic encephalopathy increased incrementally with the increasing serum glutamate values in decompensated cirrhosis patients, with an OR value of 1.005 (1.002, 1.007).
In order to visualize the weight of each predictor in the model and to visualize the model for clinical application, we used R software to construct a nomogram to demonstrate the model.The scores and risks of each  4, Tables 7, and 8.The higher the total score, the higher the risk of developing hepatic encephalopathy.
The following example (see Fig. 5) illustrates the clinical application of the nomogram model for predicting the risk of getting hepatic encephalopathy: for example, a decompensated cirrhosis patients is 60 years old (6 points), he/she has diabetes mellitus (4 points), he is found to have ascites (6 points), he does not have spontaneous bacterial peritonitis (0 points), and laboratory tests suggested the value of his glutamic aminotransferase is 200 U/L (7 points), the value of his serum potassium concentration is 3.0 mmol /L (14 points).Then, this patient had a final score of 37, and his probability of developing hepatic encephalopathy would be greater than 0.6.That is to say, this patient had a high probability (risk) of developing hepatic encephalopathy, suggesting the need for timely and early intervention by medical personnel to reduce his risk of developing hepatic encephalopathy.

Evaluation and validation of the model Distinguishability
The distinguishability of the model was evaluated using ROC (receiver operator charteristics) curves (as shown in Fig. 6A, B), and the results showed that the AUC of the model was 0.738 (95% CI 0.63-0.746) in the training set (as shown in Fig. 6A) and 0.667 (95% CI 0.541-0.706)(as shown in Fig. 6B), and the AUCs of two sets indicated that the nomogram model was greatly differentiated.According to the Cut-Off value determined by the Jorden index, when the Cut-Off value of the training set was taken as 0.150, the sensitivity of the model was 72.8%, the specificity was 64.8%, the PPV was 30.4%, and the NPV was 91.9%; when the Cut-Off value of the validation set was taken as 0.141, the sensitivity of the model was 69.7%, the specificity was 57.3%, the PPV was 34.5%, and the NPV was 84.7%.

Calibration
Bootstrap sampling method was used to perform the calibration.Patients in the training and validation sets were repeatedly sampled 1000 times, respectively, and the calibration curves were plotted after validation.The horizontal coordinate indicates the likelihood of developing hepatic encephalopathy in patients with decompensated cirrhosis, and the vertical coordinate indicates the actual event occurrence.The further the calibration curve deviates from the diagonal, the greater the error (as shown in Fig. 7A, B).The Hosmer-Lemeshow test for goodness of fit was also applied, and the results showed that χ 2 = 1.237587,P = 0.998 in the training set, χ2 = 31.90904,P = 0.0202 in the validation set, indicating that there was no significant difference between the predicted and actual observed values.

Clinical decision curve analysis
We used clinical decision curve analysis (DCA) to assess the net benefit of the model in clinical application.As shown in (Fig. 8A, B), the results of the DCA show that the model has good clinical benefit in both the training and validation sets when compared to two extreme clinical scenarios (all patients received treatment or none of them received).

Discussion
Hepatic encephalopathy is a complex disease with a wide range of etiologies and varying degrees of severity of morbidity.The use of appropriate measurement tools to assess the risk of getting hepatic encephalopathy can help to develop targeted interventions to reduce the occurrence of hepatic encephalopathy, which is important to improve patients' quality of life and reduce the burden of medical care.Therefore, the development of high-quality risk prediction tools has become the focus of research on the prevention and treatment of hepatic encephalopathy.In recent years, scholars in various countries have constructed various risk prediction models for the development of hepatic encephalopathy based on the characteristics of the local population and epidemiological data.However, those risk prediction models constructed are diverse, the predictive indicators incorporated in each model are not consistent, the assessment contents and applicable population are not uniform, resulting in a certain gap between the prediction results and the real situation.Risk prediction model studies aim to estimate the probability of an event occurring in an individual and can be divided into diagnostic models (presence or absence of a disease or symptom) and prognostic models   (whether a specific outcome will occur in the future) 2 .The common metrics used to evaluate predictive models are the degree of discrimination and calibration, and if a good degree of discrimination is available, it indicates that the predictive model can accurately distinguish high-risk population with different risks.AUC value of 0.50 indicates that the model has predictive power but poor discrimination, 0.51-0.70indicates that the model has low discrimination, 0.71-0.90indicates good discrimination, and higher than 0.90 indicates high discrimination 3 .Sensitivity reflects the ability to correctly detect positive diagnosis patients, also known as the true positive rate, and specificity reflects the ability to correctly determine people who are actually disease-free as true negatives, known as the true negative rate 4 .The risk prediction models can be divided into traditional statistical algorithm models and machine learning algorithm models according to the model building method.Traditional statistical algorithmic models are mathematical models based on statistical analysis of risk factors, i.e., the probability of disease occurrence is calculated by constructing mathematical models in which factors that can independently predict the occurrence  of an event are selected as predictors.The most common models are logistic regression and Cox proportional risk regression models.Takikawa 5 applied logistic regression analysis to construct a predictive model for the risk of developing hepatic encephalopathy, and the findings suggest that advanced age, prolonged prothrombin time, and high total serum bilirubin can be used as risk predictors for the development of hepatic encephalopathy.
Although the specificity of this study was very high, its sensitivity was low, indicating that the inclusion of the above factors alone was not sufficient to predict the development of hepatic encephalopathy.In 2019, Labenz 6 used history of minimal hepatic encephalopathy, history of hepatic encephalopathy, C-reactive protein, albumin, MELD score, serum interleukin 6 (IL-6) as predictors to establish a prediction model to validate the predictive value of IL-6 to identify the occurrence of hepatic encephalopathy, and the results showed that the predictive performance was substantially improved (AUC of 0.931).
In contrast to the logistic regression model, the Cox proportional risk regression model uses survival outcome and survival time as dependent variables, allowing simultaneous analysis of the effects of numerous factors on survival to study the incidence at different time points.Tapper 7 used demographic, clinical, laboratory, and pharmacological data to construct a predictive model for the risk of developing hepatic encephalopathy based  In this study, we chose to use logistic regression analysis to construct our analytical model, as opposed to opting for Cox's proportional risk model for several reasons: (1) the purpose of our study was to examine the impact of specific risk factors on a dichotomous outcome variable (whether or not a specific event occurs) rather than the impact on survival time.Therefore, we considered that logistic regression was more appropriate for our study, whereas the Cox proportional risk model was more appropriate for survival analysis.(2) Our dataset did not contain information on survival times, nor did we record the start and end times of observations for individuals, so it's hard to perform analyses using the Cox proportional risk model.(3) Some of the independent variables in our dataset are categorical or ordinal, whereas the Cox's proportional risk model requires the independent variables to be continuous or dichotomous.If we convert these variables to dichotomous variables, we may lose some information and precision.Logistic regression, on the other hand, can handle multi-categorical or sequential variables and only requires dummy variable coding 8,9 .(4) There are some independent variables in our data set that may not meet the basic assumption of the Cox's proportional risk model, i.e., the assumption of equal  proportional risk.This means that the impact of the independent variable on the outcome variable changes over time.If the equal proportional risk assumption does not hold, the results of the Cox's proportional risk model will lose their explanatory power.Logistic regression, on the other hand, does not require this assumption and is more flexible and robust.
The results obtained in this study suggested that age, diabetes mellitus (DM), ascites, spontaneous bacterial peritonitis (SBP), abnormal glutamate aminotransferase (ALT), and abnormal blood potassium (K) are risk factors (predictors) for the development of hepatic encephalopathy.These predictors were finally entered into the subsequent analysis and were used to build the nomogram model.
We included both potassium and sodium in the possible risk factors, and after statistical analysis, the difference in potassium was statistically significant (P < 0.05), while the difference in sodium was not statistically significant (P > 0.05), so we finally included potassium as an independent factor in the nomogram model for risk prediction of hepatic encephalopathy.It is worth noting that potassium is the major intracellular cation involved in maintaining electrolyte homeostasis and acid-base balance inside and outside the cell.Hypokalemia is defined as a serum potassium concentration less than 3.5mmol/L, which is commonly found in patients with liver cirrhosis, especially when combined with ascites, diuresis, vomiting, and diarrhea 30 .Hypokalemia may lead to metabolic alkalosis, which in turn promotes the occurrence of hepatic encephalopathy.Sodium, on the other hand, is the major extracellular cation involved in maintaining body fluid volume and osmolality.The impact of blood sodium abnormalities on hepatic encephalopathy is unclear.In light of this, we have also included a number of relevant studies in the Discussion section that support the rationale for choosing potassium rather than sodium as an independent factor in the development of hepatic encephalopathy in cirrhosis [30][31][32][33][34] .
In this study, the ROC curve was used to evaluate the predictive ability of the model, and the area under the ROC curve was calculated to evaluate the model performance.The accuracy of the model was evaluated by plotting the calibration curve.The clinical benefit of the model was evaluated using decision curve analysis (DCA).DCA is a method to evaluate prediction models by calculating the net clinical benefit.The results of the DCA showed that the risk prediction model established in this study had good clinical benefit in both the training set and the validation set when compared with two extreme clinical scenarios (i.e., all patients were treated or none of them were treated).This further validated the good performence and high value of this model in practical clinical work.
This study has some limitations.This is a single-center study, the sample size and the representativeness of the sample might be insufficient.Our study was conducted from 2016 to 2022, and we initially collected 1550 patients, and finally, after rigorous inclusion-exclusion screening, the final sample size was 1178, which was much larger than the minimum sample size requirement of constructing a risk prediction model (323 patients) 35 .Therefore, we considered that the results of our study can be applied well in clinical practices.Although this prediction model has been set up with a validation set for internal validation, an external validation with a larger sample size and multiple centers would be helpful to demonstrate the feasibility of this model in order to better generalize it.We intend to conduct more multicenter investigations to improve the sample's representativeness and applicability of the study results in future studies.
In conclusion, this study showed that age over 55 years, diabetes, ascites, spontaneous bacterial peritonitis, abnormal glutamate aminotransferase, and abnormal blood potassium concentration are independent risk factors(predictors) for the development of hepatic encephalopathy in patients with cirrhosis, and these six indicators are very meaningful for identifying the risk of developing hepatic encephalopathy in patients with decompensated cirrhosis.The risk prediction nomogram model based on the above risk factors can effectively and conveniently predict the risk of developing hepatic encephalopathy in patients with decompensated cirrhosis.This model can help clinical healthcare professionals to timely and early identify patients at high risk of developing hepatic encephalopathy, so as to intervene early and prevent the disease progression in time.

Figure 1 .
Figure 1.Flow diagram of the study.

Figure 6 .
Figure 6.ROC curves of the training set (A) and validation set (B). Note: The area under the ROC curve for the model is 0.738 and 0.667 for the training and validation sets, respectively.

Figure 7 .
Figure 7. Calibration curves of the training set (A) and validation set (B). Note: The X-axis is the predicted probability of developing hepatic encephalopathy in patients with decompensated cirrhosis, and the Y-axis is the actual probability of developing hepatic encephalopathy in patients with decompensated cirrhosis.The diagonal dashed line indicates a perfect prediction, while the solid line indicates the actual corrected prediction.

Figure 8 .
Figure 8. Calibration curves of the training set (A) and validation set (B). Note: The horizontal axis indicates that no patient received treatment after the application of the model, with a net benefit of 0. The diagonal line indicates that all patients received treatments.

Table 2 .
Baseline table for training set-validation set grouping.

Table 3 .
Analysis of differences between the hepatic encephalopathy and non-hepatic encephalopathy groups.

Table 5 .
Results of univariate logistic regression analysis.

Table 6 .
Multifactorial logistic regression analysis based on LASSO regression.

Table 7 .
Total scores of the predictors and their corresponding probability of diagnosis of hepatic encephalopathy.

Table 8 .
Scores for each predictor in the nomogram.AGE: age; DM: history of diabetes mellitus; SBP: spontaneous bacterial peritonitis; ALT: glutamic aminotransferase; K: serum potassium concentration; A: assignment; S: score.